Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 2.270
Filtrar
1.
Nature ; 622(7983): 594-602, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37821698

RESUMEN

Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.


Asunto(s)
Metagenoma , Metagenómica , Microbiología , Proteínas , Análisis por Conglomerados , Metagenoma/genética , Metagenómica/métodos , Proteínas/química , Proteínas/clasificación , Proteínas/genética , Bases de Datos de Proteínas , Conformación Proteica
2.
Nature ; 622(7983): 646-653, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37704037

RESUMEN

We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database1. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the ß-flower fold, added several protein families to Pfam database2 and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.


Asunto(s)
Bases de Datos de Proteínas , Aprendizaje Profundo , Anotación de Secuencia Molecular , Pliegue de Proteína , Proteínas , Homología Estructural de Proteína , Secuencia de Aminoácidos , Internet , Proteínas/química , Proteínas/clasificación , Proteínas/metabolismo
3.
Nature ; 622(7983): 637-645, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37704730

RESUMEN

Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy1, and over 214 million predicted structures are available in the AlphaFold database2. However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing probable previously undescribed structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem to be species specific, representing lower-quality predictions or examples of de novo gene birth. We also show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote structural similarity. On the basis of these analyses, we identify several examples of human immune-related proteins with putative remote homology in prokaryotic species, illustrating the value of this resource for studying protein function and evolution across the tree of life.


Asunto(s)
Algoritmos , Análisis por Conglomerados , Proteínas , Homología Estructural de Proteína , Humanos , Bases de Datos de Proteínas , Proteínas/química , Proteínas/clasificación , Proteínas/metabolismo , Alineación de Secuencia , Anotación de Secuencia Molecular , Células Procariotas/química , Filogenia , Especificidad de la Especie , Evolución Molecular
4.
Genome Biol ; 24(1): 135, 2023 06 08.
Artículo en Inglés | MEDLINE | ID: mdl-37291671

RESUMEN

BACKGROUND: In every living species, the function of a protein depends on its organization of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species but has so far been scarcely studied. RESULTS: Here we evaluate this diversity by comparing protein length distribution across 2326 species (1688 bacteria, 153 archaea, and 485 eukaryotes). We find that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller. CONCLUSIONS: These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more uniform than previously thought. Furthermore, we also provide evidence for a universal selection on protein length, yet its mechanism and fitness effect remain intriguing open questions.


Asunto(s)
Anotación de Secuencia Molecular , Proteínas , Análisis de Secuencia de Proteína , Secuencia de Aminoácidos , Anotación de Secuencia Molecular/métodos , Proteínas/química , Proteínas/clasificación , Proteoma , Análisis de Secuencia de Proteína/métodos , Eucariontes , Bacterias , Archaea
5.
J Biol Chem ; 298(10): 102435, 2022 10.
Artículo en Inglés | MEDLINE | ID: mdl-36041629

RESUMEN

Natural proteins are often only slightly more stable in the native state than the denatured state, and an increase in environmental temperature can easily shift the balance toward unfolding. Therefore, the engineering of proteins to improve protein stability is an area of intensive research. Thermostable proteins are required to withstand industrial process conditions, for increased shelf-life of protein therapeutics, for developing robust 'biobricks' for synthetic biology applications, and for research purposes (e.g., structure determination). In addition, thermostability buffers the often destabilizing effects of mutations introduced to improve other properties. Rational design approaches to engineering thermostability require structural information, but even with advanced computational methods, it is challenging to predict or parameterize all the relevant structural factors with sufficient precision to anticipate the results of a given mutation. Directed evolution is an alternative when structures are unavailable but requires extensive screening of mutant libraries. Recently, however, bioinspired approaches based on phylogenetic analyses have shown great promise. Leveraging the rapid expansion in sequence data and bioinformatic tools, ancestral sequence reconstruction can generate highly stable folds for novel applications in industrial chemistry, medicine, and synthetic biology. This review provides an overview of the factors important for successful inference of thermostable proteins by ancestral sequence reconstruction and what it can reveal about the determinants of stability in proteins.


Asunto(s)
Evolución Molecular Dirigida , Enzimas , Ingeniería de Proteínas , Proteínas , Estabilidad de Enzimas , Filogenia , Ingeniería de Proteínas/métodos , Estabilidad Proteica , Proteínas/química , Proteínas/clasificación , Proteínas/genética , Temperatura , Evolución Molecular Dirigida/métodos , Enzimas/química , Enzimas/clasificación , Enzimas/genética
6.
Nucleic Acids Res ; 50(W1): W412-W419, 2022 07 05.
Artículo en Inglés | MEDLINE | ID: mdl-35670671

RESUMEN

Residue coevolution within and between proteins is used as a marker of physical interaction and/or residue functional cooperation. Pairs or groups of coevolving residues are extracted from multiple sequence alignments based on a variety of computational approaches. However, coevolution signals emerging in subsets of sequences might be lost if the full alignment is considered. iBIS2Analyzer is a web server dedicated to a phylogeny-driven coevolution analysis of protein families with different evolutionary pressure. It is based on the iterative version, iBIS2, of the coevolution analysis method BIS, Blocks in Sequences. iBIS2 is designed to iteratively select and analyse subtrees in phylogenetic trees, possibly large and comprising thousands of sequences. With iBIS2Analyzer, openly accessible at http://ibis2analyzer.lcqb.upmc.fr/, the user visualizes, compares and inspects clusters of coevolving residues by mapping them onto sequences, alignments or structures of choice, greatly simplifying downstream analysis steps. A rich and interactive graphic interface facilitates the biological interpretation of the results.


Asunto(s)
Computadores , Evolución Molecular , Internet , Filogenia , Proteínas , Alineación de Secuencia , Programas Informáticos , Proteínas/química , Proteínas/clasificación , Secuencia de Aminoácidos , Visualización de Datos
7.
Nucleic Acids Res ; 50(D1): D1491-D1499, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34718741

RESUMEN

As a crucial molecular mechanism, post-translational modifications (PTMs) play critical roles in a wide range of biological processes in plants. Recent advances in mass spectrometry-based proteomic technologies have greatly accelerated the profiling and quantification of plant PTM events. Although several databases have been constructed to store plant PTM data, a resource including more plant species and more PTM types with quantitative dynamics still remains to be developed. In this paper, we present an integrative database of quantitative PTMs in plants named qPTMplants (http://qptmplants.omicsbio.info), which hosts 1 242 365 experimentally identified PTM events for 429 821 nonredundant sites on 123 551 proteins under 583 conditions for 23 PTM types in 43 plant species from 293 published studies, with 620 509 quantification events for 136 700 PTM sites on 55 361 proteins under 354 conditions. Moreover, the experimental details, such as conditions, samples, instruments and methods, were manually curated, while a variety of annotations, including the sequence and structural characteristics, were integrated into qPTMplants. Then, various search and browse functions were implemented to access the qPTMplants data in a user-friendly manner. Overall, we anticipate that the qPTMplants database will be a valuable resource for further research on PTMs in plants.


Asunto(s)
Bases de Datos de Proteínas , Plantas/genética , Procesamiento Proteico-Postraduccional/genética , Proteínas/genética , Plantas/clasificación , Proteínas/clasificación , Proteómica/normas
8.
Nucleic Acids Res ; 50(D1): D1541-D1552, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34791421

RESUMEN

ProteomicsDB (https://www.ProteomicsDB.org) is a multi-omics and multi-organism resource for life science research. In this update, we present our efforts to continuously develop and expand ProteomicsDB. The major focus over the last two years was improving the findability, accessibility, interoperability and reusability (FAIR) of the data as well as its implementation. For this purpose, we release a new application programming interface (API) that provides systematic access to essentially all data in ProteomicsDB. Second, we release a new open-source user interface (UI) and show the advantages the scientific community gains from such software. With the new interface, two new visualizations of protein primary, secondary and tertiary structure as well an updated spectrum viewer were added. Furthermore, we integrated ProteomicsDB with our deep-neural-network Prosit that can predict the fragmentation characteristics and retention time of peptides. The result is an automatic processing pipeline that can be used to reevaluate database search engine results stored in ProteomicsDB. In addition, we extended the data content with experiments investigating different human biology as well as a newly supported organism.


Asunto(s)
Bases de Datos de Proteínas , Proteínas/clasificación , Proteómica/clasificación , Programas Informáticos , Disciplinas de las Ciencias Biológicas , Humanos , Redes Neurales de la Computación , Proteínas/química
9.
Nucleic Acids Res ; 50(D1): D560-D570, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34664670

RESUMEN

The success of protein engineering and design has extensively expanded the protein space, which presents a promising strategy for creating next-generation proteins of diverse functions. Among these proteins, the synthetic binding proteins (SBPs) are smaller, more stable, less immunogenic, and better of tissue penetration than others, which make the SBP-related data attracting extensive interest from worldwide scientists. However, no database has been developed to systematically provide the valuable information of SBPs yet. In this study, a database named 'Synthetic Binding Proteins for Research, Diagnosis, and Therapy (SYNBIP)' was thus introduced. This database is unique in (a) comprehensively describing thousands of SBPs from the perspectives of scaffolds, biophysical & functional properties, etc.; (b) panoramically illustrating the binding targets & the broad application of each SBP and (c) enabling a similarity search against the sequences of all SBPs and their binding targets. Since SBP is a human-made protein that has not been found in nature, the discovery of novel SBPs relied heavily on experimental protein engineering and could be greatly facilitated by in-silico studies (such as AI and computational modeling). Thus, the data provided in SYNBIP could lay a solid foundation for the future development of novel SBPs. The SYNBIP is accessible without login requirement at both official (https://idrblab.org/synbip/) and mirror (http://synbip.idrblab.net/) sites.


Asunto(s)
Proteínas Bacterianas/clasificación , Proteínas Portadoras/genética , Bases de Datos de Proteínas , Proteínas/clasificación , Proteínas Bacterianas/química , Proteínas Portadoras/clasificación , Simulación por Computador , Humanos , Conformación Proteica , Ingeniería de Proteínas/tendencias , Proteínas/química
10.
Nucleic Acids Res ; 50(D1): D587-D595, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34850110

RESUMEN

Molecular interactions are key drivers of biological function. Providing interaction resources to the research community is important since they allow functional interpretation and network-based analysis of molecular data. ConsensusPathDB (http://consensuspathdb.org) is a meta-database combining interactions of diverse types from 31 public resources for humans, 16 for mice and 14 for yeasts. Using ConsensusPathDB, researchers commonly evaluate lists of genes, proteins and metabolites against sets of molecular interactions defined by pathways, Gene Ontology and network neighborhoods and retrieve complex molecular neighborhoods formed by heterogeneous interaction types. Furthermore, the integrated protein-protein interaction network is used as a basis for propagation methods. Here, we present the 2022 update of ConsensusPathDB, highlighting content growth, additional functionality and improved database stability. For example, the number of human molecular interactions increased to 859 848 connecting 200 499 unique physical entities such as genes/proteins, metabolites and drugs. Furthermore, we integrated regulatory datasets in the form of transcription factor-, microRNA- and enhancer-gene target interactions, thus providing novel functionality in the context of overrepresentation and enrichment analyses. We specifically emphasize the use of the integrated protein-protein interaction network as a scaffold for network inferences, present topological characteristics of the network and discuss strengths and shortcomings of such approaches.


Asunto(s)
Bases de Datos Genéticas , Mapas de Interacción de Proteínas/genética , Proteínas/genética , Programas Informáticos , Animales , Biología Computacional/tendencias , Ontología de Genes/tendencias , Redes Reguladoras de Genes/genética , Humanos , Ratones , MicroARNs/clasificación , MicroARNs/genética , Proteínas/clasificación , Interfaz Usuario-Computador
11.
Nucleic Acids Res ; 50(D1): D553-D559, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34850923

RESUMEN

The Structural Classification of Proteins-extended (SCOPe, https://scop.berkeley.edu) knowledgebase aims to provide an accurate, detailed, and comprehensive description of the structural and evolutionary relationships amongst the majority of proteins of known structure, along with resources for analyzing the protein structures and their sequences. Structures from the PDB are divided into domains and classified using a combination of manual curation and highly precise automated methods. In the current release of SCOPe, 2.08, we have developed search and display tools for analysis of genetic variants we mapped to structures classified in SCOPe. In order to improve the utility of SCOPe to automated methods such as deep learning classifiers that rely on multiple alignment of sequences of homologous proteins, we have introduced new machine-parseable annotations that indicate aberrant structures as well as domains that are distinguished by a smaller repeat unit. We also classified structures from 74 of the largest Pfam families not previously classified in SCOPe, and we improved our algorithm to remove N- and C-terminal cloning, expression and purification sequences from SCOPe domains. SCOPe 2.08-stable classifies 106 976 PDB entries (about 60% of PDB entries).


Asunto(s)
Biología Computacional , Bases de Datos de Proteínas , Proteínas/clasificación , Algoritmos , Bases de Datos de Compuestos Químicos , Regulación de la Expresión Génica/genética , Aprendizaje Automático , Proteínas/genética
12.
Nucleic Acids Res ; 50(D1): D1528-D1534, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34606614

RESUMEN

Protein-nucleic acid interactions are involved in various biological processes such as gene expression, replication, transcription, translation and packaging. The binding affinities of protein-DNA and protein-RNA complexes are important for elucidating the mechanism of protein-nucleic acid recognition. Although experimental data on binding affinity are reported abundantly in the literature, no well-curated database is currently available for protein-nucleic acid binding affinity. We have developed a database, ProNAB, which contains more than 20 000 experimental data for the binding affinities of protein-DNA and protein-RNA complexes. Each entry provides comprehensive information on sequence and structural features of a protein, nucleic acid and its complex, experimental conditions, thermodynamic parameters such as dissociation constant (Kd), binding free energy (ΔG) and change in binding free energy upon mutation (ΔΔG), and literature information. ProNAB is cross-linked with GenBank, UniProt, PDB, ProThermDB, PROSITE, DisProt and Pubmed. It provides a user-friendly web interface with options for search, display, sorting, visualization, download and upload the data. ProNAB is freely available at https://web.iitm.ac.in/bioinfo2/pronab/ and it has potential applications such as understanding the factors influencing the affinity, development of prediction tools, binding affinity change upon mutation and design complexes with the desired affinity.


Asunto(s)
Bases de Datos de Proteínas , Sustancias Macromoleculares/clasificación , Ácidos Nucleicos/genética , Proteínas/genética , Proteínas de Unión al ADN/genética , Proteínas de Unión al ADN/ultraestructura , Sustancias Macromoleculares/química , Sustancias Macromoleculares/ultraestructura , Mutación/genética , Ácidos Nucleicos/ultraestructura , Unión Proteica/genética , Proteínas/clasificación
13.
Nucleic Acids Res ; 50(D1): D54-D59, 2022 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-34755885

RESUMEN

APPRIS (https://appris.bioinfo.cnio.es) is a well-established database housing annotations for protein isoforms for a range of species. APPRIS selects principal isoforms based on protein structure and function features and on cross-species conservation. Most coding genes produce a single main protein isoform and the principal isoforms chosen by the APPRIS database best represent this main cellular isoform. Human genetic data, experimental protein evidence and the distribution of clinical variants all support the relevance of APPRIS principal isoforms. APPRIS annotations and principal isoforms have now been expanded to 10 model organisms. In this paper we highlight the most recent updates to the database. APPRIS annotations have been generated for two new species, cow and chicken, the protein structural information has been augmented with reliable models from the EMBL-EBI AlphaFold database, and we have substantially expanded the confirmatory proteomics evidence available for the human genome. The most significant change in APPRIS has been the implementation of TRIFID functional isoform scores. TRIFID functional scores are assigned to all splice isoforms, and APPRIS uses the TRIFID functional scores and proteomics evidence to determine principal isoforms when core methods cannot.


Asunto(s)
Bases de Datos de Proteínas , Isoformas de Proteínas/genética , Proteínas/genética , Proteómica , Animales , Bovinos , Pollos/genética , Humanos , Conformación Proteica , Isoformas de Proteínas/clasificación , Proteínas/química , Proteínas/clasificación
14.
Proteins ; 90(1): 110-122, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34322903

RESUMEN

Protein ß-turn classification remains an area of ongoing development in structural biology research. While the commonly used nomenclature defining type I, type II and type IV ß-turns was introduced in the 1970s and 1980s, refinements of ß-turn type definitions have been introduced as recently as 2019 by Dunbrack, Jr and co-workers who expanded the number of ß-turn types to 18 (Shapovalov et al, PLOS Computat. Biol., 15, e1006844, 2019). Based on their analysis of 13 030 turns from 1074 ultrahigh resolution (≤1.2 Å) protein structures, they used a new clustering algorithm to expand the definitions used to classify protein ß-turns and introduced a new nomenclature system. We recently encountered a specific problem when classifying ß-turns in crystal structures of pentapeptide repeat proteins (PRPs) determined in our lab that are largely composed of ß-turns that often lie close to, but just outside of, canonical ß-turn regions. To address this problem, we devised a new scheme that merges the Klyne-Prelog stereochemistry nomenclature and definitions with the Ramachandran plot. The resulting Klyne-Prelog-modified Ramachandran plot scheme defines 1296 distinct potential ß-turn classifications that cover all possible protein ß-turn space with a nomenclature that indicates the stereochemistry of i + 1 and i + 2 backbone dihedral angles. The utility of the new classification scheme was illustrated by re-classification of the ß-turns in all known protein structures in the PRP superfamily and further assessed using a database of 16 657 high-resolution protein structures (≤1.5 Å) from which 522 776 ß-turns were identified and classified.


Asunto(s)
Conformación Proteica , Proteínas , Algoritmos , Secuencia de Aminoácidos , Análisis por Conglomerados , Cristalografía , Enlace de Hidrógeno , Modelos Moleculares , Proteínas/química , Proteínas/clasificación , Proteínas/metabolismo , Estereoisomerismo
15.
World J Microbiol Biotechnol ; 38(1): 8, 2021 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-34837551

RESUMEN

Microalgae are potential feedstocks for the commercial production of carotenoids, however, the metabolic pathways for carotenoid biosynthesis across algal lineage are largely unexplored. This work is the first to provide a comprehensive survey of genes and enzymes associated with the less studied methylerythritol 4-phosphate/1-deoxy-D-xylulose 5-phosphate pathway as well as the carotenoid biosynthetic pathway in microalgae through bioinformatics and comparative genomics approach. Candidate genes/enzymes were subsequently analyzed across 22 microalgae species of lineages Chlorophyta, Rhodophyta, Heterokonta, Haptophyta, Cryptophyta, and known Arabidopsis homologs in order to study the evolutional divergence in terms of sequence-structure properties. A total of 403 enzymes playing a vital role in carotene, lutein, zeaxanthin, violaxanthin, canthaxanthin, and astaxanthin were unraveled. Of these, 85 were hypothetical proteins whose biological roles are not yet experimentally characterized. Putative functions to these hypothetical proteins were successfully assigned through a comprehensive investigation of the protein family, motifs, intrinsic physicochemical features, subcellular localization, pathway analysis, etc. Furthermore, these enzymes were categorized into major classes as per the conserved domain and gene ontology. Functional signature sequences were also identified which were observed conserved across microalgal genomes. Additionally, the structural modeling and active site architecture of three vital enzymes, DXR, PSY, and ZDS catalyzing the vital rate-limiting steps in Dunaliella salina were achieved. The enzymes were confirmed to be stereochemically reliable and stable as revealed during molecular dynamics simulation of 100 ns. The detailed functional information about individual vital enzymes will certainly help to design genetically modified algal strains with enhanced carotenoid contents.


Asunto(s)
Carotenoides/metabolismo , Genómica/métodos , Microalgas/enzimología , Proteínas/genética , Vías Biosintéticas , Dominio Catalítico , Biología Computacional , Minería de Datos , Evolución Molecular , Ontología de Genes , Microalgas/clasificación , Microalgas/metabolismo , Modelos Moleculares , Conformación Proteica , Dominios Proteicos , Proteínas/química , Proteínas/clasificación , Proteínas/metabolismo
16.
Nat Commun ; 12(1): 5800, 2021 10 04.
Artículo en Inglés | MEDLINE | ID: mdl-34608136

RESUMEN

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 1068 possible sequences, which nevertheless constitute only the astronomically small fraction 10-80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.


Asunto(s)
Modelos Estadísticos , Proteínas/química , Secuencia de Aminoácidos , Biología Computacional , Bases de Datos de Proteínas , Epistasis Genética , Evolución Molecular , Aprendizaje Automático , Mutación , Proteínas/clasificación , Proteínas/genética , Alineación de Secuencia
17.
PLoS One ; 16(10): e0258625, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34669708

RESUMEN

Although genes carry information, proteins are the main role player in providing all the functionalities of a living organism. Massive amounts of different proteins involve in every function that occurs in a cell. These amino acid sequences can be hierarchically classified into a set of families and subfamilies depending on their evolutionary relatedness and similarities in their structure or function. Protein characterization to identify protein structure and function is done accurately using laboratory experiments. With the rapidly increasing huge amount of novel protein sequences, these experiments have become difficult to carry out since they are expensive, time-consuming, and laborious. Therefore, many computational classification methods are introduced to classify proteins and predict their functional properties. With the progress of the performance of the computational techniques, deep learning plays a key role in many areas. Novel deep learning models such as DeepFam, ProtCNN have been presented to classify proteins into their families recently. However, these deep learning models have been used to carry out the non-hierarchical classification of proteins. In this research, we propose a deep learning neural network model named DeepHiFam with high accuracy to classify proteins hierarchically into different levels simultaneously. The model achieved an accuracy of 98.38% for protein family classification and more than 80% accuracy for the classification of protein subfamilies and sub-subfamilies. Further, DeepHiFam performed well in the non-hierarchical classification of protein families and achieved an accuracy of 98.62% and 96.14% for the popular Pfam dataset and COG dataset respectively.


Asunto(s)
Biología Computacional/métodos , Proteínas/clasificación , Aprendizaje Profundo , Humanos , Modelos Genéticos , Conformación Proteica , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Análisis de Secuencia de Proteína
18.
Comput Math Methods Med ; 2021: 5770981, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34413898

RESUMEN

Antioxidant proteins (AOPs) play important roles in the management and prevention of several human diseases due to their ability to neutralize excess free radicals. However, the identification of AOPs by using wet-lab experimental techniques is often time-consuming and expensive. In this study, we proposed an accurate computational model, called AOP-HMM, to predict AOPs by extracting discriminatory evolutionary features from hidden Markov model (HMM) profiles. First, auto cross-covariance (ACC) variables were applied to transform the HMM profiles into fixed-length feature vectors. Then, we performed the analysis of variance (ANOVA) method to reduce the dimensionality of the raw feature space. Finally, a support vector machine (SVM) classifier was adopted to conduct the prediction of AOPs. To comprehensively evaluate the performance of the proposed AOP-HMM model, the 10-fold cross-validation (CV), the jackknife CV, and the independent test were carried out on two widely used benchmark datasets. The experimental results demonstrated that AOP-HMM outperformed most of the existing methods and could be used to quickly annotate AOPs and guide the experimental process.


Asunto(s)
Antioxidantes/química , Aprendizaje Automático , Peroxirredoxinas/química , Proteínas/química , Algoritmos , Aminoácidos/análisis , Antioxidantes/clasificación , Biología Computacional , Bases de Datos de Proteínas/estadística & datos numéricos , Evolución Molecular , Humanos , Cadenas de Markov , Peroxirredoxinas/clasificación , Proteínas/clasificación
19.
Proteins ; 89(11): 1541-1556, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34245187

RESUMEN

The expansion of three-dimensional protein structures and enhanced computing power have significantly facilitated our understanding of protein sequence/structure/function relationships. A challenge in structural genomics is to predict the function of uncharacterized proteins. Protein function deconvolution based on global sequence or structural homology is impracticable when a protein relates to no other proteins with known function, and in such cases, functional relationships can be established by detecting their local ligand binding site similarity. Here, we introduce a sequence order-independent comparison algorithm, PocketShape, for structural proteome-wide exploration of protein functional site by fully considering the geometry of the backbones, orientation of the sidechains, and physiochemical properties of the pocket-lining residues. PocketShape is efficient in distinguishing similar from dissimilar ligand binding site pairs by retrieving 99.3% of the similar pairs while rejecting 100% of the dissimilar pairs on a dataset containing 1538 binding site pairs. This method successfully classifies 83 enzyme structures with diverse functions into 12 clusters, which is highly in accordance with the actual structural classification of proteins classification. PocketShape also achieves superior performances than other methods in protein profiling based on experimental data. Potential new applications for representative SARS-CoV-2 drugs Remdesivir and 11a are predicted. The high accuracy and time-efficient characteristics of PocketShape will undoubtedly make it a promising complementary tool for proteome-wide protein function inference and drug repurposing study.


Asunto(s)
Algoritmos , Antivirales/farmacología , Reposicionamiento de Medicamentos/métodos , Proteínas/metabolismo , Adenosina Monofosfato/análogos & derivados , Adenosina Monofosfato/química , Adenosina Monofosfato/metabolismo , Adenosina Monofosfato/farmacología , Alanina/análogos & derivados , Alanina/química , Alanina/metabolismo , Alanina/farmacología , Antivirales/química , Sitios de Unión , Proteasas 3C de Coronavirus/química , Proteasas 3C de Coronavirus/metabolismo , Bases de Datos de Proteínas , GTP Fosfohidrolasas/química , GTP Fosfohidrolasas/metabolismo , Fosfoglicerato Mutasa/química , Fosfoglicerato Mutasa/metabolismo , Proteínas/química , Proteínas/clasificación , Curva ROC , SARS-CoV-2/efectos de los fármacos
20.
Biomed Res Int ; 2021: 5574789, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34046497

RESUMEN

Cytochrome (CYP) enzymes catalyze the metabolic reactions of endogenous and exogenous compounds. The superfamily of enzymes is found across many organisms, regardless of type, except for plants. Information was gathered about CYP2D enzymes through protein sequences of humans and other organisms. The secondary structure was predicted using the SOPMA. The structural and functional study of human CYP2D was conducted using ProtParam, SOPMA, Predotar 1.03, SignalP, TMHMM 2.0, and ExPASy. Most animals shared five central motifs according to motif analysis results. The tertiary structure of human CYP2D, as well as other animal species, was predicted by Phyre2. Human CYP2D proteins are heavily conserved across organisms, according to the findings. This indicates that they are descended from a single ancestor. They calculate the ratio of alpha-helices to extended strands to beta sheets to random coils. Most of the enzymes are alpha-helix, but small amounts of the random coil were also found. The data were obtained to provide us with a better understanding of mammalian proteins' functions and evolutionary relationships.


Asunto(s)
Citocromos/química , Citocromos/clasificación , Filogenia , Proteínas/química , Secuencia de Aminoácidos , Animales , Biología Computacional/métodos , Simulación por Computador , Sistema Enzimático del Citocromo P-450/química , Sistema Enzimático del Citocromo P-450/clasificación , Sistema Enzimático del Citocromo P-450/genética , Sistema Enzimático del Citocromo P-450/metabolismo , Citocromos/genética , Citocromos/metabolismo , Humanos , Ligandos , Ratones , Modelos Moleculares , Conformación Proteica en Hélice alfa , Dominios y Motivos de Interacción de Proteínas , Estructura Secundaria de Proteína , Proteínas/clasificación , Proteínas/genética , Alineación de Secuencia , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...